Brian S. Evans, Ph.D.
Migratory Bird Center
Smithsonian Conservation Biology Institute
Brian S. Evans, Ph.D.
Migratory Bird Center
Smithsonian Conservation Biology Institute
In this lesson we will explore for loops and functions. A for loop is code that allows you to repeat an operation a set number of iterations. A function is code that allows you to create or modify objects in a repeatable way. As data analysis often includes many repeated tasks, for loops and user-defined functions can save considerable time, shorten R scripts, and make your scripts more legible. Understanding how they work is crucial to developing good coding practices.
For this lesson, we will use three example data frames: Iris sepal and petal measurements collected by botanist Edgar Anderson in the 1920’s, bird traits collected by Wilman et al. (2014), and bird point count observations from Washington, DC (Evans et al. 2016).
Open R Studio and run the following lines of code to load the data into R:
#=================================================================================*
# ---- set up ----
#=================================================================================*
# Load libraries
library(RCurl)
library(tidyverse)
# Note: If you have yet to install these library, please do so with:
# install.packages('rCurl') ; install.packages('tidyverse')
# Provide the web addresses of the files (note: iris is preloaded example data):
url <- 'https://raw.githubusercontent.com/bsevansunc/workshop_languageOfR/master'
habitsURL <- getURL(
paste(url, 'birdHabits.csv', sep = '/')
)
countsURL <- getURL(
paste(url, 'birdCounts.csv', sep = '/')
)
# Read in the data:
birdHabits <- tbl_df(
read.csv(text = habitsURL, stringsAsFactors = FALSE)
)
birdCounts <- tbl_df(
read.csv(text = countsURL, stringsAsFactors = FALSE)
)For the iris dataset, I think it’s best to do some familiar cleaning steps:
# Clean up iris for analysis:
irisTbl <- tbl_df(iris)
names(irisTbl) <-
c('sepalLength',
'sepalWidth',
'petalLength',
'petalWidth',
'species')In the previous lessons, we worked with several functions (e.g., “c”, “mean”). Functions allow you to simplify complex tasks, which is especially useful if you need to run a task multiple times. Program R contains many functions and many more still are created by R users and provided to the R community as collections of functions known as packages (or libraries). As datasets and data handling needs are often distinct, relying exclusively on built-in and community-defined functions is limiting.
Why would you write your own functions? Learning how to create your own functions, or customize existing functions, provides you with the flexibility to solve unique problems, shorten your script, and make your R code more legible for others. Writing your own functions is easy, as long as you follow the correct syntax. The basic structure is (DO NOT RUN):
functionName <- function(functionTarget) {
functionBody: What you would like to happen when you run your function
}
Let’s illustrate writing functions with a pretty silly example – a function that will add 1 to any value. Below, we assign a name to the function, addOneFun <-, tell R that the code in the enclosing curly brackets is a function of some object x (function(x)), and provide the argument the function will be evaluated and returned (x + 1).
# First function:
addOneFun <- function(x){
x+1
}
# Testing the function on a numeric value:
42+1
addOneFun(42)
# Testing the function on a vector of numeric values:
v <- c(1,1,2,3,5)
v + 1
addOneFun(v)
If you are doing a lot of data querying, writing your own query functions can be a great way to make your code more legible. Let’s use the birdCounts data to explore writing custom query functions. For example, perhaps you frequently need to query the number of individuals of a given species at each site:
# Query by species function:
speciesSubset <- function(spp){
birdCounts[birdCounts$species == spp, ]
}
# Test function:
birdCounts[birdCounts$species == 'grca', ]
speciesSubset('grca')
It’s best to write your functions as general as possible. For example, the species subset function above is great if you are going to query just the birdCounts data frame. If you also intend to query the birdHabits frame by species, you may want to make the data frame one of the arguments.
# Query by species function, generalized:
speciesSubset <- function(dfIn, spp){
dfIn[dfIn$species == spp, ]
}
# Test function, birdCounts:
birdCounts[birdCounts$species == 'grca', ]
speciesSubset(birdCounts, 'grca')
# Test function, birdHabits:
birdHabits[birdHabits$species == 'grca', ]
speciesSubset(birdHabits, 'grca')Another way to generalize a query function is to make the query field one of the arguments of the function. Note that the below statements are equivalent:
# Subset to catbirds using $ and matrix notation:
birdHabits[birdHabits$species == 'grca', ]
birdHabits[birdHabits[,'species'] == 'grca',]Because of this, we can write a VERY general function to query any data frame!
# Very generalized query:
query <- function(dfIn, variable, condition){
dfIn[dfIn[,variable] == condition,]
}
# Test query:
birdHabits[birdHabits$species == 'grca', ]
birdHabits[birdHabits[,'species'] == 'grca',]
query(birdHabits, 'species', 'grca')Exercise One:
- Write a function to subset the
birdHabitsdata frame to just ground foraging birds.- Write a function that will calculate the total number of individuals observed of a given species in the
birdCountsdata frame(i.e., the sum of count for a data frame subset by species)
Functions can sometimes get pretty long and include multiple objects. When this is the case, it’s good practice to include return to illustrate which object is being returned. Let’s write a function in multiple steps that calculates the average (mean) number of individuals of a given species observed on a point count.
# Query function, mean count:
meanSpeciesCounts <- function(spp){
# Number of unique site values:
nSites <- length(unique(birdCounts$site))
# Subset birdCounts to the species of interest:
birdCounts_sppSubset <- birdCounts[birdCounts$species == spp, ]
# Calculate the total number of birds observed:
nBirds <- sum(birdCounts_sppSubset$count)
# Return mean number of birds per site:
return(nBirds/nSites)
}
# What is the average number of observed catbirds?
meanSpeciesCounts('grca')Note in the above that the objects nSites, birdCounts_sppSubset, and nBirds were not placed in your global environment. Objects defined within the function body are only defined inside the function, not the global environment.
Functions can also be nested. In other words, functions can include user-defined functions written inside the function body. We have already done this quite a bit with functions in base R. For example, the length function above is nested inside our user-defined function meanSpeciesCounts. Let’s try to nest our speciesSubset function inside the meanSpeciesCounts function above.
# Query by species function, generalized:
meanSpeciesCounts <- function(dfIn, spp){
# Number of unique site values:
nSites <- length(unique(dfIn$site))
# Calculate the total number of birds observed:
nBirds <- sum(speciesSubset(dfIn, spp)$count)
# Return mean number of birds per site:
return(nBirds/nSites)
}
# What is the average number of observed catbirds?
meanSpeciesCounts(birdCounts, 'grca')Exercise Two:
- Using the
birdHabitsdata frame, write a function to count the number of species in a given diet and foraging guild.- The mathematical formula for standard error is provided below. Convert this to an R function (Note: the function for standard deviation is
\[StdErr (x) = \frac{StDev(x)}{\sqrt{n}}\]sdand the function for square root issqrt):
Why would you use for loops? Let’s look at a common example. You have been asked to calculate the mean petal length of three Iris species: Iris setosa, Iris versicolor, and Iris virginica. (coded as the factor levels setosa, versicolor, and virginica in the species field of the irisTbl data frame). Using the mean function and tools that we have addressed thus far, use the following to calculate the mean petal length for each species:
# Filter irisTbl to setosa:
irisTbl[irisTbl$species == 'setosa', ]
# Extract the petalLength field (column):
irisTbl[irisTbl$species == 'setosa', ]$petalLength
# Calculate the mean of petal lengths:
mean(irisTbl[irisTbl$species == 'setosa', ]$petalLength)Exercise Three:
Calculate the mean petal length of each of the Iris species using matrix notation (as above) and a custom function.
The code you generated above is very repetitive – the only change in each of the lines you should have created above was the name of the species. Writing code like this makes your scripts unnecessarily long and prone to user input error. For loops should be used whenever a chunk of code is repeated more than twice.
We’ll return to the Iris example a bit later, but for now let’s start with a very simple for loop.
Let’s start our exploration of for loops by constructing a vector, v using a set of five numbers.
# Generate vector v:
v <- c(1,1,2,3,5)
v
We would like to modify the values in vector v by adding one to each value. This might be written mathematically as:
To do so, we will write a for loop. Writing proper for loops requires following these three steps:
The key to writing for loops is having a clear understanding of indexing. Recall that value v[i] is equal to the value at position i in vector v. Let’s take a look at the value of v at position 3:
# Explore vector v using indexing:
i <- 3
v[i]
v[3]
v[3] == v[i]
If we want to calculate v + 1 at position 3, we would write:
# Add 1 to the value of v at position three:
v[3] + 1
v[i] + 1For loop development begins by defining an output object of a given length that your for loop will write to. This step is very important – your for loop will run much, much slower if you do not do so!
Vector objects are defined as follows:
# Define a vector for output:
vNew <- vector('numeric', length = length(v))
str(vNew)
The first argument of the vector function is the type of object you would like to create. The next argument sets the length of the created object to be the equivalent to that of v.
Values of vector vNew will be replaced location-by-location. For example, let’s compare the initial value of v with the resultant value of v + 1 at position 3:
# Explore filling values of vNew by index:
i <- 3
v[i]
vNew[i] <- v[i] + 1
vNew[i]
v[i] + 1 == vNew[i]
The utility of the for loop is that we can calculate the above for each position (i) in vector v. This is done by setting the “for loop sequence” statement which defines the locations over which the for loop will be calculated. The for loop sequence for locations one through five is written as (DO NOT RUN):
# for(i in 1:5)The above can be translated as “for position i in positions one through five”.
To make our code as flexible as possible, we generally do not want to have to directly type in the stopping point of the for loop. We can use 1:length(v) or the function seq_along to specify the for loop address. Run the following and note the behavior:
v
1:5
1:length(v)
seq_along(v)The for loop sequence statement can then be written as (DO NOT RUN):
# Example for loop sequence statements:
# for(i in 1:length(v))
# for(i in seq_along(v))The above statements are more flexible than providing a numeric stopping point for your for loop.
The for loop sequence statement is followed by the “body” statement that tells R what to do during each iteration of the loop. The body associated with our “add one” formula is (DO NOT RUN):
# vNew[i] <- v[i] + 1If your for loop spans multiple lines, place the body within curly brackets, for example (DO NOT RUN):
# for(i in 1:length(x)){
# body
# }Putting together our steps of: 1) Creating an output object, 2) The for loop sequence statement, and 3) The body statement, our completed for loop is written as follows (RUN THIS ONE):
# First for loop:
vNew <- numeric(length = length(v))
for(i in seq_along(v)){
vNew[i] <- v[i] + 1
}
Take a look at the output and compare it with the values of v:
# Explore first for loop output:
vNew
v
vNew == vWe specify the length of the vector to provide R with stopping rules – without this for loops can become very memory hungry when running over large datasets
The following two sections of code are equivalent, but the latter much easier to read. As writing code is both a tool and a method of communication, you should ensure that your code is as readable as possible.
for(i in 1:length(v)) vNew[i] <- v[i] + 1
for(i in 1:length(v)){
vNew[i] <- v[i] + 1
}When writing for loops, it is necessary to ensure that the loop is doing what you expect it to do. A simple way to ensure that this is the case is to specify i and run the instructions on just that value. For example, to observe the behavior of the for loop at position 3:
i = 3
vNew[i] <- v[i] + 1
vNew[i]
v[i]Exercise Four:
- Use a for loop to test whether each species of bird in the
birdHabitsdata frame is an omnivore (logical test).- Calculate the total number of records associated with omnivorous birds.
You may have noticed that none of the operations we completed in the previous section actually required for loops (for example, v + 1 is calculated for each value of v by default). For loops are predominantly used to split-apply-combine data. In data science, split-apply-combine problems are those that relate to situations in which you seek to split a dataset into multiple parts, apply a function to each part, and combine the resulting parts. For loops can be a great tool for split-apply-combine problems.
We will summarize the iris data frame to illustrate using a for loop for a typical split-apply-combine problem. Calculating the mean for each species without a for loop required the following code:
# Mean petal lengths of Iris species without a for loop:
mean(irisTbl[irisTbl$species == 'setosa', ]$petalLength)
mean(irisTbl[irisTbl$species == 'versicolor', ]$petalLength)
mean(irisTbl[irisTbl$species == 'virginica', ]$petalLength)We can use a for loop to calculate this value across species. To do so, we first need to create a vector of species:
# Make a vector of species to loop across:
irisSpecies <- levels(irisTbl$species)
irisSpeciesNext, we’ll create an empty vector to store our output:
# For loop output statement:
petalLengths <- vector('numeric',length = length(irisSpecies))
petalLengthsThe vector of Iris species will define the bounds of our for loop sequence (DO NOT RUN):
# For loop sequence:
# for(i in seq_along(irisSpecies))Split: To construct the for loop body, we’ll start by splitting the data:
# Exploring the iris data, subsetting by species:
i <- 3
irisSpecies[i]
irisTbl[irisTbl$species == irisSpecies[i], ]
# Split:
iris_sppSubset <- irisTbl[irisTbl$species == irisSpecies[i], ]Apply: Modification of the data:
# Calculate mean petal length of each subset:
mean(iris_sppSubset$petalLength)The completed for loop is constructed as:
# Make a vector of species to loop across:
irisSpecies <- levels(irisTbl$species)
# For loop output statement:
petalLengths <- vector('numeric',length = length(irisSpecies))
# For loop:
for(i in seq_along(irisSpecies)){
iris_sppSubset <- irisTbl[irisTbl$species == irisSpecies[i], ]
petalLengths[i] <- mean(iris_sppSubset$petalLength)
}
Combine: Combining the for loop output can be done in a number of ways. Below, we bind our results in a tidy tibble data frame, using the data_frame function.
# Make a tibble data frame of the for loop output:
petalLengthFrame <- data_frame(species = irisSpecies, count = petalLengths)
petalLengthFrameExercise Five:
Use a for loop and the
birdHabitsdata frame to calculate the number species in each diet guild.
For loops can be used to perform powerful data queries. To illustrate how this works, we will use birdCounts data frame and calculate the total number of omnivorous birds that were observed at each of the sites. Take a moment to explore the data:
# Explore the bird count data:
head(birdCounts)
str(birdCounts)
# Explore the bird trait data:
head(birdHabits)
str(birdHabits)
Our first goal will be to get a vector of birds that are ground foragers from the birdHabits data frame:
# Extract vector of omnivorous species:
omnivores <- birdHabits[birdHabits$diet == 'omnivore',]$speciesSplit: To evaluate the number of omnivores per site, we will split the data into individual sites. To do so, we generate a vector of unique sites and query the data frame by the site at a given position in the site vector.
# Generate a vector of unique sites:
sites <- unique(birdCounts$site)
# Site at position i:
i <- 3
sites[i]
# Subset data:
birdCounts_siteSubset <- birdCounts[birdCounts$site == sites[i],]
birdCounts_siteSubset
Apply: We then apply a function to each individual part. In this case, we will use %in% to extract only records associated with omnivores and sum the count field.
# Just a vector of omnivore counts:
countVector <- birdCounts_siteSubset[birdCounts_siteSubset$species %in% omnivores, ]$count
# Get total number of omnivores at the site:
nOmnivores <- sum(countVector)
Combine: Output may be combined together as we have with previous for loop statements in this lesson. In the code below, I have combined the sites and nOmnivore vectors into a tibble data frame:
# Generate a vector of unique sites:
sites <- unique(birdCounts$site)
outVector <- vector('numeric', length = length(unique(sites)))
for(i in seq_along(sites)){
birdCounts_siteSubset <- birdCounts[birdCounts$site == sites[i],]
countVector <- birdCounts_siteSubset[birdCounts_siteSubset$species %in% omnivores, ]$count
outVector[i] <- sum(countVector)
}
# Combine:
data_frame(site = sites, nOmnivores = outVector)
An alternative to combining vectors that I often find useful is combining a list of data frames using the tidyverse function bind_rows:
# Generate a vector of unique sites:
sites <- unique(birdCounts$site)
outList <- vector('list', length = length(unique(sites)))
for(i in seq_along(sites)){
birdCounts_siteSubset <- birdCounts[birdCounts$site == sites[i],]
countVector <- birdCounts_siteSubset[birdCounts_siteSubset$species %in% omnivores, ]$count
outList[[i]] <- data_frame(site = sites[i], nOmnivores = sum(countVector))
}
# Combine:
bind_rows(outList)We may also be interested in using a for loop to generate a vector of numbers based on some mathematical operation. For example, consider we have a value, 10, and want to calculate the mathematical expression:
\[n_t = 2(n_{t-1})\]Let’s start by creating a vector of length 5 for our output:
# For loop output statement:
n <- vector('numeric', length = 5)
n
We must first start with a “seed” value for our vector. This is the initial value of vector n.
# Setting the seed value:
n[1] <- 10
nLet’s calculate the first five values in this series. Because our for loop starts with our seed value, we are only interested in the second through fifth positions of this vector. Thus, our for loop sequence statement is (DO NOT RUN):
# For loop sequence:
# for(i in 2:length(n))For each iteration, the following body statement will be evaluated (the example is at position 2):
# Exploring the construction of the for loop body:
i <- 2
n[i]
n[i-1]
n[i] <- 2*n[i-1]
nAnd our complete for loop statement becomes:
# Output:
n <- vector('numeric', length = 5)
# Seed:
n <- 10
# For loop:
for(i in 2:5){
n[i] = n*v[i-1]
}Exercise Six:
One of my favorite for loops was created by Leonardo Bonacci (Fibonacci). He created the first known population model, from which the famous Fibonacci number series was created. He described a population (N) of rabbits at time t as the sum of the population at the previous time step plus the time step before that:
\[N_t = N_{t-1} + N_{t-2}\]
- Use the combine function to create a seed vector of two values, zero and one.
- Use the formula above and your seed vector to generate the first 20 numbers of the Fibonacci number sequence. Hint: The for loop initialization will begin at third position – it will NOT include all of 1:20!
You can use for loops and functions to examine fundamental theoretical models in ecology. (Only) If you’re feeling super comfortable with for loops and functions, try out these additional exercises:
The exponential population growth model describes the growth of a population without resource limitations. This model describes the size of a population (\(N_t\)) at a given time unit as a function of the per capita growth rate of the population (r) and the size of the population at time t. This model is summarized as:
The logistic population growth model (Verhulst) states that the growth of the population will be constrained by the population carrying capacity (K) as a function of resource limitations. This model is summarized as (Note: the remaining variables are as above):